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Abstract 

Phylogenetic networks are necessary to represent the tree of life expanded by edges 
to represent events such as horizontal gene transfers, hybridizations or gene flow. Not 
all species follow the paradigm of vertical inheritance of their genetic material. While 
a great deal of research has flourished into the inference of phylogenetic trees, statis¬ 
tical methods to infer phylogenetic networks are still limited and under development. 

The main disadvantage of existing methods is a lack of scalability. Here, we present 
a statistical method to infer phylogenetic networks from multi-locus genetic data in 
a pseudolikelihood framework. Our model accounts for incomplete lineage sorting 
through the coalescent model, and for horizontal inheritance of genes through reticu¬ 
lation nodes in the network. Computation of the pseudolikelihood is fast and simple, 
and it avoids the burdensome calculation of the full likelihood which can be intractable 
with many species. Moreover, estimation at the quartet-level has the added computa¬ 
tional benefit that it is easily parallelizable. Simulation studies comparing our method 
to a full likelihood approach show that our pseudolikelihood approach is much faster 
without compromising accuracy. We applied our method to reconstruct the evolution¬ 
ary relationships among swordtails and platyflshes {Xiphophorus: Poeciliidae), which 
is characterized by widespread hybridizations. 

Keywords: reticulate evolution, hybridization, coalescent model, incomplete lineage 
sorting 

Evolutionary relationships are typically visualized in a tree, which implicitly assumes 
vertical transfer of genetic material from ancestors to descendants. However, not all species 
follow this paradigm. If genes can be horizontally transferred between some organisms, a tree 
is not a good representation of their history. Such reticulate events include hybridization, 
horizontal gene transfer or migration with gene flow, and require methods to infer phyloge¬ 
netic networks. While a great deal of research has flourished for the inference of phylogenetic 
trees from different types of data, methods to infer phylogenetic networks are still limited 
and under development. 
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There are mainly two kinds of phylogenetic networks: implicit and explicit. Implicit net¬ 
works -also called split networks- describe the discrepancy in gene trees, or other sources of 
data, and methods are well developed to reconstruct these networks [DEIEIII]. These meth¬ 
ods tend to be fast. However, implicit networks lack biological interpretation as the internal 
nodes do not represent ancestral species. Explicit networks, on the other hand, represent 
explicit reticulation events and each node represents an ancestral species. Combinatorial 
methods to infer explicit networks (which we call phylogenetic networks here) are fast but 
ignore gene tree error and incomplete lineage sorting (ITS) as a possible source of gene tree 
discordance (e.g. 0 )- Model-based methods are most accurate but can be computationally 
challenging. They calculate the likelihood of an observed gene tree given a species network 
taking into account both reticulation and ITS [HI [6l [7] . Their scope was expanded in [9] to 
search for the most likely phylogenetic network based on multi-locus data (see also [lO] for a 
different likelihood framework, where sites instead of genes are treated as independent and 
ITS is ignored). The likelihood-based method in [9], implemented in PhyloNet, provides a 
solid theoretical framework to estimate the maximum likelihood phylogenetic network from 
a set of gene trees. It has several advantages: it incorporates uncertainty on the gene trees 
estimated from sequence data, accounts for a background level of gene tree discordance due 
to ILS, and controls the complexity of the network with a cross validation step. However, 
its likelihood computation is heavy and becomes intractable when increasing the number of 
taxa or the number of hybridizations, making this method practical for small scenarios of 
up to about 10 species and 4 hybridizations in the network. 

Here, we provide a fast statistical method to estimate phylogenetic networks from multi¬ 
locus data. We first present the theory for the pseudolikelihood of a network. We do so 
by deriving the proportion of the genome that has each 4-taxon tree (quartet concordance 
factors) as expected under the coalescent model extended by hybridization events, and we 
prove the generic identihability of the model. We then use the observed quartet concordance 
factors as inferred from the multi-locus data to estimate the species network. Our method 
SNaQ (Species Networks applying Quartets) is implemented in our open-source software 
package PhyloNetworks in Julia and publicly available at https://github.com/crsl4. 

Like PhyloNet, our method can incorporate uncertainty in estimated gene trees and gene 
tree discordance due to ILS. Our pseudolikelihood has computational advantages. It is sim¬ 
pler and more scalable to many species, compared to the full likelihood. It also scales to 
a large number of loci because estimation of gene trees can be highly parallelized, then 
summarized by only 3 tree frequencies on each 4-taxon subsets used as input in the pseudo- 
likelihood. In simulations, our method showed good performance and scaled to scenarios for 
which PhyloNet could not run. We also used SNaQ to infer the evolutionary relationships 
between Xiphophorus fishes, from 1,183 loci across 24 taxa. Our results were congruent with 
m and rehned the placement of some hybridizations found in that study. The analyses 
here presented show that SNaQ can enable scientists to incorporate organisms to the “tree 
of life” in parts that are more net-like than tree-like, and thus, complete a broader picture 
of evolution. 
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Models 


Phylogenetic networks 

Intuitively, a phylogenetic network is a phylogenetic tree with added hybrid edges, causing 
some nodes to have two parents (but see [I2])- Phylogenetic networks can describe various 
biological processes causing gene flow from one population to another such as hybridization, 
introgression, or horizontal gene transfer. Hybridization occurs when individuals from 2 
genetically distinct populations interbreed, resulting in a new separate population. Intro¬ 
gression, or introgressive hybridization, is the integration of alleles from one population into 
another existing population, through hybridization and backcrossing. Genes are horizon¬ 
tally transferred when acquired by a population through a process other than reproduction, 
from a possibly distantly related population. Although these three processes are biologically 
different, we do not make the distinction when modeling them with a network. In other 
words, our model takes into account all three biological scenarios, but those scenarios are 
not distinguishable in the estimated phylogenetic network unless more biological information 
is provided. 

Just like phylogenetic trees, networks can be rooted or unrooted. A rooted phylogenetic 
network on taxon set X is a connected directed acyclic graph with vertices V = {r} U JA U 
Vff U Vt, edges E = Eh U Et and a bijective leaf-labeling function f : Vl ^ X with the 
following characteristics. The root r has indegree 0 and outdegree 2. Any leaf n G lA has 
indegree 1 and outdegree 0. Any tree node v E Vt has indegree 1 and outdegree 2. Any 
hybrid node v eVh has indegree 2 and outdegree 1. A tree edge e G Et is an edge whose 
child is a tree node. A hybrid edge e G Eh is an edge whose child is a hybrid node. Unrooted 
phylogenetic networks are typically obtained by suppressing the root node and the direction 
of all edges. We also consider semi-directed unrooted networks, where the root node is 
suppressed and we ignore the direction of all tree edges, but we maintain the direction of 
hybrid edges, thus keeping information on which nodes are hybrids. The placement of the 
root is then constrained, because the direction of the two hybrid edges to a given hybrid 
node inform the direction of time at this node: the third edge must be a tree edge directed 
away from the hybrid node and leading to all the hybrid’s descendants. Therefore the root 
cannot be placed on any descendant of any hybrid node, although it might be placed on 
some hybrid edges. 

We further assume that the true network is of level-1 [1], i.e. any given edge can be 
part of at most one cycle. This means that there is no overlap between any two cycles (but 
see the Discussion). Refer to PP for other types of evolutionary networks. Throughout this 
work, we denote by 

• n the number of taxa, 

• h the number of hybridization events and 

• ki the number of nodes in the undirected cycle created by the ith hybrid node. 
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Figure 1: Example of rooted and semi-directed phylogenetic networks with h = 2 
hybridization events and n = 7 sampled taxa. Inheritance probabilities 7 represent 
the proportion of genes contributed by each parental population to a given hybrid node. 
Left: rooted network modelling several biological processes. Taxon F is a hybrid between 
two non-sampled taxa Y and Z with 72 ~ 0.50, and the lineage ancestral to taxa C and 
D has received genes introgressed from a non-sampled taxon X, for which 71 0.10. An 

alternative process at this event could be the horizontal transfer of only a handful of genes, 
corresponding to a very small fraction 71 0.001. Center: semi-directed network for the 

biological scenario just described. Although the root location is unknown, its position is 
constrained by the direction of hybrid edges (directed by arrows). For example, C, G or 
E cannot be outgroups. Right: rooted network obtained from the semi-directed network 
(center) by placing the root on the hybrid edge that leads to taxon F (labeled by 1 — 72 ). 

For example, in Fig. [T] (center) n = 7,h = 2, ki = 3 and /c 2 = 4. The main parameter of 
interest is the topology J\f of the semi-directed network. Like phylogenetic trees, this network 
can be rooted by a known outgroup. The other parameters of interest are t, the vector of 
branch lengths in coalescent units (see below), and a vector of inheritance probabilities 7 , 
describing the proportion of genes inherited by a hybrid node from one of its hybrid parent 
(see Fig. [T]). Only identihable branch lengths are considered in t. For example, with only 
one sequenced individual per taxon, the lengths of external edges are not identihable and 
are not estimated. 

Pseudolikelihood on a network. Pseudolikelihood has already been used to estimate 
phylogenetic trees under ILS [13], and here we extend the theory to networks. The pseudo- 
likelihood of a network is based on the likelihood formulas of its 4-taxon subnetworks. These 
4-taxon likelihoods are not independent but fast to compute. A quartet is a 4-taxon un¬ 
rooted tree. For taxon set s = {a, b, c, d}, there are only three possible quartets, represented 
by the splits qi = ab\cd, q 2 = ac\bd and gs = ad\bc. 

The concordance factor (CF) of a given quartet (or split) is the proportion of genes 
whose true tree displays that quartet (or split) [Tl|. We use the term ‘CF’ as opposed 
to ‘probability’ to emphasize that CFs measure genomic support. Probabilities (such as 
posterior probabilities or bootstrap values) are most often thought to measure statistical 
uncertainty [Uj. Intuitively, splits between natural evolutionary groups of organisms are 
recovered by most or all genes, and thus have high CFs. On the other hand, the presence of 
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a hybrid would be captured by intermediate CFs. For example, if a is a hybrid intermediate 
between b and c, the CFs of ab\cd and ac\bd would be around 0.5 while the CF of ad\bc would 
be near 0 . 

The theoretical CFs (CFg^, CF^j, CF^g) expected under the coalescent model are already 
known if the network is a species tree [16]. Interestingly, these CFs are independent of the 
root placement in the species tree, and are given by (1—2/3e“*, l/3e“*, l/3e“*) if the unrooted 
species tree is qi = ab\cd with an internal edge of length t coalescent units. On a species 
network with reticulations, the probabilities of rooted gene trees was fully derived in [S] and 
more efficiently in [9]. But the probabilities of unrooted gene trees was not determined. We 
derive the quartet probabilities in the next section. In particular, they do not depend on the 
root placement in the network, which makes them simple and fast to compute. 

To calculate the likelihood of a 4-taxon network from gene trees Q = {Gi, G 2 , ..., Gg} at g 
loci, we consider the number of gene trees X = {Xg^, Xg^, Xg^) that match each of the three 
quartets. Assuming unlinked loci, X follows a multinomial distribution with probabilities 
(CFqj, CFq 2 , CFqg), the quartet CFs expected under the coalescent on the 4-taxon network. 
With a larger network on n > 4 taxa, we consider all 4-taxon subsets s and combine the 
likelihood of each 4-taxon subnetworks to form the full network pseudolikelihood: 

L = n(CF«.)^"‘(CF,.)*”(CF,3)^» (1) 

s^S 

where S is the collection of all 4-taxon sets and qi = qi{s) {i = 1, 2, 3) are the 3 quartet trees 
on s. In ([1]), the data are summarized in the X values, and the candidate network governs 
the CF values, which we derive below. 

Quartet CF for a 4-taxon network nnder ILS. For h = 1 hybridization, there are 
5 different semi-directed 4-taxon networks (up to tip re-labelling). We describe here the 
expected quartet CFs (probabilities) for only one case and refer to SI Text for the remaining 
cases. 

Under the hybridization model described in [ 6 l [ 8 ] and the network in Fig. [2] (left), each 
gene from taxon G has probability 7 of having descended from the hybridization edge sister 
to D, and probability 1 — 7 of having descended from the original tree branch, sister to {AB). 
Therefore, the expected CFs are weighted averages of CFs obtained on 2 species tree with 
ILS. Because the quartet probabilities do not depend on the root placement in each species 
tree, they do not depend on the root placement in the original network either. Fig. [2] (top 
right) shows the corresponding semi-directed network, and all rooted networks displaying it 
share the same quartet CFs, obtained from the coalescent models on the 2 unrooted species 
trees shown in Fig. [2] (bottom right). These trees have the same topology but different branch 
lengths in this case. Therefore we get that CFafe|cd = (1 — 7 ) (1 — 2/3 e"*^) - 1 - 7(1 — 2/3 
and the other 2 quartets occur with equal probabilities: CF^cIm = CFa^itc = ( 1 — 7 )(l/ 3 )e“*i-|- 

7(l/3)e-*i-*C 

On other semi-directed networks, more than 2 underlying unrooted species trees are 
needed if a hybrid node has more than one descendent taxon. In the network in Fig. [3] 
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Figure 2; Rooted network (left) and its semi-directed version (top right). Quartet 
CFs expected under the network do not depend on the root placement, and are weighted 
averages of quartet CFs expected under the unrooted trees (bottom right). 


for instance, the hybrid node has two descendants, A and B. Given this network, the 
computation of the CF for the major quartet AB\CD is as follows. First, A and B can 
coalesce along the branch of length ti with probability 1 — If they do not coalesce 

(with probability then there are 3 options: both originated from the minor hybrid edge 
(probability 7 each); both originated from the major hybrid edge (each with probability 
1 — 7); or one {A or B) originated from the minor hybrid edge but the other [B or A) 
from the major. Assuming that each lineage’s origin is independent of the other, we get 
CF ab\cd = l-e-*i+e-'i((l-7)2(l-2/3e-*2-‘Q+27(l-7)(l-2/3e-‘Q+72(l-2/3e-*4-*Q). 
Therefore, this CF is a weighted average of CFs from the 4 species trees shown in Fig. [3] 
(see SI Text for all other cases). 

With more than 1 hybridization {h > 1) there are an inhnite number of semi-directed 
4-taxon networks, but we can still calculate the quartet CFs if we assume that the cycles 
created by different reticulations do not share edges. We do so recursively on h, by reducing 
each network to an equivalent network with h = 0 or 1 and transformed branch lengths. 
For example, the network in Fig. [2] leads to equal CFs of the 2 minor quartets ac\bd and 
ad\bc, so it is equivalent to the unrooted species tree ab\cd with internal branch length 
ts = — log((l — 7 )e“^i -1- 76 “*^“*^) to ensure 1/3 = CFqcIm given above. This new species 

tree and the original network have the same expected quartet CFs. The assumption of 
a level-1 network guarantees non-overlapping reticulation cycles, such that we can hnd an 
equivalent 4-taxon network with h = 0 or 1 and the same expected quartet CFs. We then 
just apply the network formulas above. The transformed branch lengths of the equivalent 
network are given in the SI Text. 

Detecting the presence of a hybridization. Identihability is a basic requirement if one 
seeks to learn about parameters from data. Here our parameters are the network topology 
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Figure 3: Example of a 4-taxon semi-directed network (left), with known direction 
of both hybrid edges but unspecified position of the root. The root can be placed on 
the internal edges with length ^ 2 , h, ^ 4 , or on the external edges to C or D. The quartet CFs 
on this network are weighted averages of CFs under 4 trees with weights as shown (right). 


J\f, branch lengths t and inheritance values 7 . We already know that quartet CFs do not 
depend on the root placement, so the rooted network is not identifiable and we only consider 
semi-directed networks J\f here. Our pseudolikelihood model would be identifiable if two 
different combinations of parameters (W, t, 7 ) and (A/"', t', 7 ') yield different sets of quartet 
CFs. We show here and below that some reticulations and some parameters are impossible 
(or hard) to detect. This theory is used later to reduce the parameter space explored by our 
heuristic search, to avoid network and parameter combinations that are not identifiable. 

On n = 4 taxa, we already showed that the network in Fig. [2] is equivalent to a tree 
with some appropriate internal branch length. In fact, the same holds true for all 4-taxon 
networks with /c = 2 or 3 nodes in their reticulation cycle: these reticulations cannot be 
detected. If /c = 4 i.e. if the reticulation involves more distantly related taxa, then the 
presence of the hybridization can be detected based on the quartet CFs. However, networks 
with the same unrooted topology are unidentifiable from each other from only 4 taxa, like 
the 2 networks in Fig. 0] if only Di is sampled (n = 4). They only differ in the placement 
of the hybrid node, which is therefore not identifiable, even if the unrooted network and the 
presence of the reticulation is. 

In general, for networks with n > 4 taxa, we restrict our focus to the case when J\f' 
is the network topology obtained from J\f by removing a single hybrid edge of interest. 
The assumption of non-intersecting cycles allows us to study the detectability of this one 
hybrid edge given the other hybridizations in the network (see SI Text). Assuming that 
all (”) 4-taxon sets are used in the pseudolikelihood, the network Af gives us 3(”) quartet 
CFs expected under the coalescent. The presence of the hybridization of interest can be 
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Figure 4: Networks with k = A nodes in the reticulation cycle and identical un¬ 
rooted topologies. They differ in their hybrid position (left: good diamond, right; bad 
diamond I). If D 2 is not sampled (n = 4), only 7 j(l — for i = 1,2 are identifiable and 
the 2 networks are not distinguishable from each other. 


detected if the quartet CFs from (TV, t, 7 ) cannot all be equal to the quartet CFs from 
(A/”', t' , 7 ') simultaneously. We matched both systems from M and M' using Macaulay 2 [T7] . 
and checked the values of (^, 7 ) and (i', 7 ') when the two systems of CFs were equal (see 
SI Text for full details). Apart from the obvious case 7 = 0 for the hybrid edge absent in 
M', we found that M and M' were not distinguishable when = D 01 % = 00 for some tree 
branches b, implying either a hard polytomy or a branch with no ILS and a reduction of the 
problem to a 4-taxon network. We can ignore these cases with the reasonable assumption 
Al: t G (0, 00 ) for all tree branches and 7 G (0,1). 

A1 is not sufficient, however, to ensure that the presence of each hybridization in J\f can 
be detected. Increasing taxon sampling helps detect a hybridization only if the added taxa 
increase the size of the reticulation cycle. Namely, if the cycle only involves k = 2 nodes 
(see Fig. [S3]), then Af is not distinguishable from J\f', regardless of n. For k = 3, some 
hybridizations are detectable and some are not. If any two of the three subtrees defined 
by the hybridization cycle (Fig. 15^ have only one taxon, then the hybridization is not 
detectable. It is, if instead at least two subtrees contain more than one taxon. In general, 
hybridizations with k > A can be detected if n > 5. Here and below, we use the terms 
detectable or identifiable in their generic sense [T 8 l [T9] , which simply means that some 
conditions on (t, 7 ) are required, like Al, but that all these conditions are met except on a 
subset of measure zero. 

We further determined if the direction of a given hybrid edge was identifiable (in addition 
to its presence) when n = 5 and A: = 4, in a case when the direction is not identifiable from 
4 taxa. Fig. [4] shows two networks that differ only in the placement of the hybrid node, 
but otherwise have the same unrooted topology. We proved that they yield different sets 
of quartet probabilities and therefore are distinguishable from each other, showing that the 
direction of the hybridization becomes identifiable when n > 5. 
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Figure 5: Networks with k nodes in a hybridization cycle: k = 2,3,4 and 5 from 
left to right. When k = 3, parameters are not identifiable. A good triangle corresponds to 
ni,n 2 ,'n .3 > 2, in which case setting tu = 0 makes the other parameters identifiable. When 
/c = 4, parameters are not all identifiable for the bad diamond I (no = n 2 = = 1 but 

Til > 2) and for the bad diamond II (no = ni = n 2 = 1 but > 2). 


Identifiability of branch lengths and heritabilities. We now turn to the case when 
J\f' = J\f to determine if t and 7 are identifiable given a known network topology. Like 
before, we used Macaulay2 to determine under which conditions two different combinations 
of parameters (t, 7 ) and {f, 7 ') yield different sets of quartet probabilities for a fixed network 
M (see SI Text for details). 

Just as before, the identifiability depends on the type of network (Fig. [S3|) . With only 
4 taxa, there are more parameters than equations (3 quartet CFs), so t and 7 are not 
separately identifiable, so we focus first on the case with n > 5. 

If n > 5, parameter identifiability is again easier if the reticulation involves more distantly 
related taxa. If k>5, all the parameters are identifiable. If fc < 3, parameters are not 
identifiable. If fc = 4, parameters are identifiable if either Uq > 2 (or 77 , 2 , symmetrically), or 
if both rii and > 2 (see Fig. [S3]). We call this a good diamond. Parameters are not all 
identifiable in the remaining 2 cases, which we call bad diamonds I and II (see Fig. IS3I) . The 
bad diamond I already lacked identifiability under a different model in |2nj . 

Practical consequences. A naive search for the most likely network would get stuck 
alternating between non-distinguishable networks or parameter sets. Hence we reduced the 
searchable space to only consider networks whose reticulations involve enough nodes. Indeed, 
all reticulations with k = 2 and most with k = 3 are either not detectable at all, or their 
parameters are not all identifiable. For hybridizations with k = 3, we only kept those 
with Hi > 2 for all i = 0,1,2 (see Fig. and we enforced fi 2 = 0 to make the other 
6 parameters identifiable. We denote this case as a good triangle. For bad diamonds I 
{k = 4), we reparametrized the 3 non-identifable values ( 7 ,^ 1 ,to) into 2 identifiable ones 
( 7(1 — e“*°), (1 — 7)(1 — e“*i)) (see SI Text). For bad diamonds II, we set tis = 0 and kept 
the other 5 parameters ( 7 , to,^ 2 ,^ 3 )- 
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Network estimation procedure 

The input for our method is a table of quartet CFs observed from multi-locus data (the X 
values in ([1])), across many or all 4-taxon subsets from the n taxa of interest. 

Pseudolikelihood optimization. The maximum pseudolikelihood (MPL) estimate is the 
network, branch lengths t and 7 heritabilities that maximize the pseudolikelihood ([T]). This 
MPL optimization was fully implemented in SNaQ (Species Networks applying Quartets) 
and is part of our open source package PhyloNetworks in Julia [21]. The numerical op¬ 
timization of branch lengths and 7 parameters for a fixed topology is performed with a 
derivative-free methodology in the NLopt package for Julia. The heuristic optimization of 
the network topology uses a strategy similar to that in [9] . Given a hxed maximum number 
of hybridizations (hm), we search for the MPL network with at most hm hybridizations. 
Since the pseudolikelihood can only improve when hybridizations are added, we expect the 
hnal network to have h = exactly. A network is estimated for various values of hm, fol¬ 
lowed by a model selection procedure to select the appropriate number of hybridizations (see 
below). For a given hm, the search is initialized with a tree from a very fast quartet-based 
tree estimation method like ASTRAL [22] or Quartet MaxCut [2ll [23]. The length of each 
branch is initialized using the average observed CF of the quartets that span that branch 
exactly, CF, transformed to coalescent units by t = — log(l — 3/2 CF). The search then 
navigates the network space by altering the current network using one of 5 proposals, chosen 
at random: 1 ) move the origin of an existing hybrid edge, 2 ) move the target of an existing 
hybrid edge, 3) change the direction of an existing hybrid edge, 4) perform a nearest-neighbor 
interchange move (NNI) on a tree edge, and 5) add a hybridization if the current topology 
has h < hm- Any new proposed network is checked to verify that it is a semi-directed level -1 
network with h < hm and with at least one valid placement for the root. More details on 
these moves are provided in SI Text. Although the deletion of a current hybridrization is not 
proposed (because the MPL network should have h = hm), this deletion is still performed 
when suggested by the data, if the numerical optimization of parameters returns a 7 = 0 . 
In this case, the corresponding hybrid edge is removed and the search attempds to add it 
back at random in the neighborhood of the original hybrid edge. If this attempt fails for all 
neighbors, the hybridization is deleted entirely and the search continues from a network with 
1 fewer hybridization. Similarly, if the numerical optimization returns a branch of length 
f = 0, an NNI move is proposed immediately on that branch. The search continues until 
the pseudolikelihood converges or until the number of consecutive failed proposals reaches a 
limit. 

In [23], Huber et ah proved that the space of unrooted level-1 networks is connected 
by local subnetwork transfers, which generalize the NNI operations on trees and which are 
similar to our moves 1, 2 and 4. Although we do not have a formal proof that the MPL 
network can be reached from the starting tree using our proposals, the results in [25] suggest 
that it is the case. 
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Statistical uncertainty. There are two sources of uncertainty when we estimate CFs 
from sequence data. Gene trees are not observed directly but estimated, and only a finite 
number of genes can be sampled. Our preferred estimation of quartet CFs integrates over 
both sources of noise using BUCKy [271126], to estimate true gene tree conflict and discard 
conflict due to gene tree error. BUCKy returns estimated genome-wide CFs and their 95% 
credibility intervals. These CFs were shown to be most influenced by highly informative 
genes and least influenced by genes with large tree uncertainty [T5]. Very briefly, MrBayes 
[25] is run on each gene separately and the full tree samples from all genes serve as input 
for BUCKy, which is run separately on each 4-taxon set. BUCKy has a prior probability of 
(1 -|- a/3)/(l -|- a) that 2 genes share the same quartet tree. For example, choosing a = 1 
amounts to assuming a prior concordance probability of 0.667, compared to 0.333 if gene 
trees matched just by chance. 

A faster way to estimate CFs from sequences would be to use maximum likelihood with 
RAxML [29] (or maximum parsimony for faster estimation) on each gene separately, and 
then to simply count the number of genes that display each quartet tree. To account for tree 
uncertainty, one may drop any gene, for a given 4-taxon set, that does not have bootstrap 
support above some threshold (like 70%) for one of the 3 quartets. This method would not 
account for the uncertainty due to having a limited number of genes. With only 10 genes, 
for instance, the estimated CFs would necessarily be of low precision, of the form i/10 in 
the best case when all 10 gene trees are strongly supported. 

If some genes are missing some taxa, the quartet CFs obtained on a given 4-taxon set 
can simply use the subset of genes that have sequences for the 4 taxa of interest. In most 
cases, a large number of genes can be included for each given 4-taxon sets, even if none of 
the genes have data across the full taxon set. Furthermore, the collection of 4-taxon sets 
with available CF data does not need to be exhaustive, as the sum in ([T]) simply involves 
the sampled 4-taxon sets. 

To measure uncertainty in the network, one may re-do the network analysis on bootstrap 
data sets. If we estimated credibility intervals for CFs with BUCKy, then 100 credible 
sets of quartet CFs can be obtained by sampling each CF from its posterior distribution, 
approximated by its credibility interval. If CFs were obtained using RAxML and observed 
quartet frequencies in gene trees, then bootstrap sets of quartet CFs could be obtained by 
sampling one bootstrap tree from each gene. To summarize the networks estimated from 
these bootstrap sets, we first calculated the support for edges being in the major tree: the 
tree obtained by suppressing the minor hybrid edge (with 7 < 0.5) at each reticulation. We 
then summarized the support for the placement of each minor hybrid edge on that tree, 
considering 2 edges as equivalent if they are of the same type (hybrid or tree edges) and 
define the same clusters [9]. 

Uncertainty in the number of hybridizations h is more difficult to capture (see Discussion). 
We used here a slope heuristic to find where the network score changes from a sharp to a slow 
linear decrease as h increases. We also looked to see if the bootstrap support for successive 
reticulations dropped at the same h value. 
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Results 


Simulated data 

We carried out simulations to compare the speed and accuracy of SNaQ and PhyloNet. 
Given that PhyloNet uses the rooted and full gene trees, SNaQ can only be expected to 
perform as accurately as PhyloNet at best. Our simulations show that a pseudolikelihood 
approach does not compromise too much accuracy, but greatly improves speed. 

We simulated g gene trees with ms ED] under four different networks: {n,h) = ( 6 , 1 ), 
( 6 , 2 ), (10,1) and (15,3), with 7 values set to 0.2 or 0.3 on each minor hybrid edge (see SI 
Text) These network topologies were chosen at random by simulating a tree with n taxa 
under the coalescent, then choosing two edges at random for the origin and target of each 
hybridization and rejecting networks of level > 1. On 6 taxa all reticulations were hard to 
reconstruct with fc = 4, including a bad diamond I in the case h = 2. On 10 and 15 taxa, 
both networks also had a diamond, of the bad type II for n = 10. We varied the number of 
genes between 10 and 3000. All analyses were run on 2.7-3.5 GHz processors. 

We hrst used the true simulated gene trees for inference. The rooted gene trees served 
as input for PhyloNet and the unrooted quartet GFs as observed in the g gene trees served 
as input for SNaQ. The semi-directed network returned by SNaQ was rooted by the out¬ 
group species, when compatible with the estimated hybrid edges. Next, we used Seq-Gen 
[3T] to simulate sequences of length 500 under HKY, k, = 2, A,G,G and T frequencies of 
0.300414,0.191363,0.196748,0.311475 and population mutation rate 9 = 0.036, as in [9]. 
Gene trees were estimated with MrBayes [28] using 10® generations sampled every 200, 25% 
burnin and an HKY model. The consensus trees (one per gene) served as input for Phy¬ 
loNet. The posterior tree samples were then used in BUGKy [23 ES] for each 4-taxon set, to 
estimate quartet GFs and use them as input for SNaQ. For this pipeline, we used the tools 
implemened by [32] and available at https://github.com/nstenz/TICR. This procedure 
was replicated 30 times. The accuracy of each method was measured as the proportion of 
times that the estimated network matched the true network. To compare rooted networks we 
used the distance in [53], which is a metric on reduced networks (including level -1 networks) 
and is implemented in PhyloNet. We used it to detect equality between rooted networks, 
but not to measure how “close” networks were, because this distance is very sensitive to 
small differences such as a change in the direction of a hybrid edge. 

Fig. [ 6 ] summarizes the accuracy and speed of SNaQ and PhyloNet. On 10 or 15 taxa 
PhyloNet was too slow to run (a single replicate with 10 taxa and 300 loci required over 400 
hours), so we cannot provide a comparison of accuracy on these 2 larger networks. 

For networks with h = 2 or more, the accuracy of SNaQ decreased. So, for each semi- 
directed network estimated by SNaQ, we determined if its unrooted topology matched that 
of the true network. Fig. [7| shows that in the vast majority of cases when the directed 
network was incorrectly estimated, its unrooted topology was still correctly inferred from 
true gene trees and for n = 6 with estimated gene trees. For n > 10, the inferred direction of 
hybrid edges degraded when gene trees were estimated. In most replicates on 10 taxa, this 
was because the bad diamond H near the root in the true network had a wrong estimated 
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n=6, h=1 



n=10, h=2 



n=6, h=2 




Figure 6: Performance (average computing time per replicate) of SNaQ and Phy- 
loNet: in simulations using true gene trees on networks with n = 6, 10 or 15 taxa and 
h = 1, 2 or 3. Each replicate consisted of 10 independent runs with full optimization 
of branch lengths and inheritance probabilities for each run. Pie charts display accuracy 
(black: probability of recovering the true network). With n = 10 and 300 or more loci, or 
with n = 15, PhyloNet was too slow to run. 


13 






placement of the hybrid node. 

To detemine which features in the network were correctly estimated, we extracted the 
major tree from each network, that is, the tree obtained by keeping the major hybrid edge 
and suppressing the minor hybrid edge at each hybrid node. We then compared the true 
major tree (from the true network) to the estimated major tree using the Robinson-Foulds 
distance (see Fig. | 8 ]). The major tree was correctly estimated from 300 or more genes in all 
scenarios, except when n = Q,h = 2 and 300 genes (1 replicate out of 30) and 1000 genes 
(1 replicate out of 30). In both cases, the true major tree was displayed in the estimated 
network but the major hybrid edge was estimated as a minor edge with 7 < 0.5. Therefore, 
the network’s “backbone”, i.e. the major vertical inheritance pattern, can still be estimated 
accurately even when the full network and hybrid edges are not (Fig. [^. 

Among cases when the major tree was correctly estimated, we determined the detection 
accuracy of each true hybridization event. To do so, we compared each estimated hybridiza¬ 
tion with the true hybridization of interest. In each network (true and estimated), we re¬ 
moved the other hybridizations by suppressing their minor hybrid edges and used the known 
outgroup to root both networks. We then calculated the hardwired cluster distance between 
the two resulting networks to determine if the estimated hybridization event matched the 
true hybridization of interest: connecting the same donor edge to the same recipient edge 
in the major tree (Fig. |9]). For n = 6 , the hybridizations forming a good diamond were 
recovered with high accuracy from 100 genes, but the hybridization forming a bad diamond 
I (case h = 2) was very hard to recover, needing more than 1000 genes for an accurate infer¬ 
ence of the hybrid edges’ direction. Still, the unrooted cycle was correctly estimated from 
100 genes or more. For n = 10 and u = 15 taxa, the hybridization creating a cycle of k = 4 
nodes was also very hard to detect with its correct direction, although its undirected cycle 
was accurately recovered from a few hundred genes. Hybridizations were recovered more ac¬ 
curately as their cycles spanned more nodes, with a high recovery rate for the hybridizations 
with k = 6 and k = 7 from 100 genes or more. 

Xiphophorus fishes evolution 

We re-analyzed transcriptome data from [n] to reconstruct the evolutionary history of 24 
swordtails and platyhshes {Xiphophorus: Poeciliidae). Based on high CFs of splits in conflict 
with their species tree followed by a series of ABBA-BABA tests [35], [n] concluded that 
hybridization or gene flow was widespread in the history of these tropical hshes. We re¬ 
analyzed their first set of 1183 transcripts. BUCKy was performed on each of the 10,626 
4-taxon sets. The resulting quartet CFs were used in SNaQ, using hm = 0 to 5 and 10 runs 
each. The network with h = 0 and the major tree in the network with h = 1 were identical 
to the total evidence tree in with X. xiphidium placed within the grade of southern 
platyhshes (SP), making the northern platyhshes (NP) paraphyletic (see SI Text). With 
h > 2 the major tree was almost identical but with NP monophyletic (Fig. fTOl) because 
X. xiphidium was found sister to the rest of the NP species, but involved in a reticulation 
(see below). With h > 3, a reticulation within the southern swordtails (SS) was found 
consistently (7 = 0.43), but with a direction in conhict with SS being an outgroup clade. Its 
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Figure 7: Accuracy of SNaQ in simulations using true gene trees or sequence align¬ 
ments. Even when the semi-directed topology was not recovered, the unrooted topology 
was estimated correctly for most replicates using 30 loci or more and h <2. 
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Figure 8 : Accuracy of SNaQ to recover the major tree in the species network, 
from seqnence alignments. The major tree is obtained by suppressing all minor hybrid 
edges (7 < 0.5) to capture the major vertical inheritance pattern. Accuracy is measured as 
half the Robinson-Foulds distance between the true and estimated tree, i.e. the number of 
incorrect edges in the estimated tree. A lot fewer genes are needed to accurately estimate 
the major vertical pattern, compared to the horizontal pattern. 

cycle had only /c = 5 nodes, 4 of them leading to a single taxon ( see SI Text) so we suspect 
an error in the inferred hybrid node and gene flow direction. The extra 2 reticulations found 
with h = 4 and 5 had low 7 values (in [0.006 — 0.16]). 

The network scores (negative log-pseudolikelihood) decreased sharply from h = 0 to 
h = 2 then slightly and somewhat linearly (see SI Text), suggesting that h = 2 best fits 
the fish data using a slope heuristic |l 6 l 05]. The network estimated with h = 2 (Fig. fTOj) 
fonnd X. xiphidium involved in an ancient reticulation, contributing a proportion 7 = 0.17 
of genes to the lineage ancestral to northern swordtails (NS). This reticulation might explain 
the placement of X. xiphidium closer to the root in m, from tree-based methods that do 
not acconnt for potential gene flow. The second hybridization (7 = 0.20) was fonnd from the 
popnlation ancestral to X. multilineatus and X. nigrensis into X. nezahuacoyotl, and relates 
to a high CF found by m for a clade uniting X. nezahuacoyotl and the nigrensis gronp. 

Bootstrap data sets were simulated by sampling each quartet CF from a uniform distribn- 
tion on its 95% credibility interval (conservatively) then normalizing the sampled CFs across 
the 3 quartets on each 4-taxon set. For each bootstrap data set we estimated a network using 
3 runs, and h = 3 (instead of 2 ) because the third inferred reticulation had a high 7 (see SI 
Text) and to assess the ability of the bootstrap procedure to identify the best h valne. If the 
bootstrap was consistent with the slope henristic, we expected high bootstrap snpport for 
the placement of the first 2 reticulations and lower snpport for the third. As expected, this 
third reticulation and network topology within the SS clade was variable among bootstrap 
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Figure 9: Accuracy of SNaQ to recover each hybridization: proportion of times each 
reticulation was correctly inferred (connecting the correct donor edge to the correct recipient 
edge in the major tree), among sequence alignments in which the major tree was recovered. 
Minor hybrid edges are numbered and colored as in SI Text. For reticulations creating a 
cycle of A: = 4 nodes, we also calculated the proportion of times that this undirected (or 
“unrooted”) cycle was correctly inferred, even though the identity and direction of hybrid 
edges in this cycle might be incorrect (empty symbols). The proportion of times that all 
hybridizations were correctly inferred (black lines) was low when a single hybridization with 
k = 4 was hard to recover (bad diamond I in case n = 6, h = 2, and bad diamond II in case 
n = 10). 
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Figure 10: Xiphophorus reticulate evolution estimated with SNaQ: from 1183 genes, 
h = 2, rooted with the southern swordtails outgroup clade (SS). NS: northern swordtails, 
SP: southern platyfishes, NP: northern platyhshes. Black edges: major tree (including hy¬ 
brid edges with 7 > 0.5). Colored solid arrows: minor hybrid edges, annotated by their 
estimated 7. Black numbers: bootstrap support for edges in the major tree, if different from 
100%. Colored numbers: bootstrap support for the placement of minor hybrid edges. One 
reticulation had 75% support for a different donor lineage (dotted arrow) than inferred from 
the original data. 

networks (see SI Text), suggesting uncertainty in the major tree within this clade (Fig. fTOj) . 
The rest of the tree was highly supported, as was the placement of the reticulation involving 
X. xiphidium. The reticulation involving X. nezahuacoyotl had split support for its donor 
lineage, with 75% support for a more ancestral lineage (Fig. fTOj) . 


Discussion 

Many methods are being developped to understand organisms whose evolution behaves more 
net-like rather than tree-like. There is evidence of reticulation at all levels in the tree of life: 
deep among early prokaryotic and eukaryotic groups, to shallow among recently diverged 
species (e.g. [Ml EH [38]) or even among populations of the same species. Our new and 
fast statistical method to infer phylogenetic networks from multi-locus data could be used 
at these various levels in the tree of life. 

Network model and assumptions. Network inference is theoretically and computation¬ 
ally challenging. Split networks can be estimated rapidly, yet lack an evolutionary model 
and biological interpretability. [M] proposed a very fast distance-based approach to re¬ 
construct topological ancestral recombination graphs (tARGs) from a long alignment, but 
the biological interpretability of tARGs is still limited. The evolution model in [ 8 ] uses an 
explicit network and satisfyingly accounts for various processes: reticulation events, deep 
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coalescences, and substitutions. Yet a full likelihood estimation of large network (as in [9]) 
seems beyond computational reach. Our pseudolikelihood method offers an alternative, al¬ 
lowing the estimation of bigger and more complex networks while maintaining biological 
interpretability and a flexible evolutionary model. 

We assumed a level-1 network throughout, where each hybrid node is part of a single 
cycle. This assumption is quite restrictive, but |10] showed that sequence data and gene trees 
on present-day species do not contain enough information to reconstruct complex networks, 
even from many loci. Therefore, some assumption has to be made to limit the network 
complexity. Extending our method to networks with intersecting cycles will need further 
work to restrict the search to candidate networks that are distinguishable from each other. 
Indeed, |10] show that different level-2 networks can have the exact same likelihood, and 
hence pseudolikelihood. So no method based on gene trees can ever decide which of these 
level-2 networks is true. Under a model without ITS, using full gene trees and branch length 
in substitutions per sites comparable across genes, [iQ] showed that level-1 networks are 
distinguishable but level-2 networks are not necessarily. Extending our approach to higher 
level networks, with or without ITS, will require extensive theory to work around this lack 
of identihability. 

Our approach allows for multiple individuals per species. All alleles from the same species 
simply need to be treated as a known and hxed polytomy in the network. Future work could 
include this and other topology constraints on the network, to reduce the computational 
burden when there are known phylogenetic relationships. 

Branch lengths. We allow hybrid edge lengths to be 0, but we do not constrain them to 
be 0 (unlike in [6l[8]) even though each gene flow event has to occur between contemporary 
populations. If one parental population went extinct or has no sampled descendants, the 
hybrid edge from this parent has a positive length in the observable network. A second 
reason is that a long branch can £t a population bottleneck, as might be expected in the 
formation of a new hybrid species. Not constraining hybrid branch lengths to 0 has a 
computational burden, however. Future implementations might enforce this constraint, when 
taxon sampling is thorough and extinction of parental populations can be ruled out. 

By considering quartet topologies only, we ignored branch lengths in gene trees. This 
choice frees us from various assumptions. Using gene tree branch lengths, which are in 
substitutions per site, would require some assumption on gene rates to make branch lengths 
comparable across trees, and a molecular clock on gene trees. Other assumptions would 
also be needed on population sizes, shared or not across lineages. The recent approach in 
[TT| should scale well to many taxa, but makes these strong assumptions because it requires 
accurate distances obtained from branch lengths in gene trees. On the contrary, our approach 
should be robust to rate variation across genes and across lineages, and does not require any 
assumption on population sizes. 

Identifiability of the topology. Yu et al. [8] already noted a lack of identihability from 
rooted gene trees for reticulations with k = 3 from only 4 taxa (including the outgroup). 
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We found a similar lack of identifiability from unrooted quartets if n < 5. In practice, some 
reticulations are hard to detect even with 5 or more taxa, if some branches are long with no 
ILS (close to violating Al). However, in these cases the unrooted topology of the network 
can still be recovered, even if the direction of gene flow and the placement of the hybrid 
node is not. Therefore, heuristic strategies that keep the unrooted network unchanged, or 
that just slightly modify it, may improve the search for the best network. 

More tools are needed to study unrooted and semi-directed phylogenetic networks. For 
instance, no distance measure has been developed for such networks, that we know of. 
Distances between rooted networks would also be needed, that would be less sensitive to 
small changes in the unrooted or semi-directed topologies than the distance proposed in 
[33] . New notions of edge equivalence would also be needed on unrooted and semi-directed 
networks. It would help summarize a bootstrap sample of networks for instance, with no 
need for an outgroup. 

We propose here a tree-based but informative summary by extracting the major tree 
from each network, obtained by dropping any minor hybrid edge (with inheritance 7 < 0.5). 
Because this tree summarizes the major vertical inheritance pattern at each node, it can be 
considered an estimate of the species tree. We found that recovering the underlying species 
tree can be much easier (requiring fewer genes) than recovering the horizontal signal. Even 
if the species tree is the main purpose of a study, [3l| showed that species-tree methods can 
be inconsistent in recovering the vertical signal if there is gene flow, so using a network can 
be benehcial to avoid the possible inconsistency of tree-based coalescent methods. 

Missing data. All data analyzed here had full taxon sampling from each gene, and we 
were able to use all 4-taxon sets. Future work could assess the impact of missing data 
(gene sequences, or 4-taxon sets) on the method’s accuracy. Missing 4-taxon sets will be 
necessary for large networks, because the number of 4-taxon sets grows very rapidly with 
the number of taxa (~ n'^/24). With many taxa, one may randomly select a collection 
of 4-taxon sets and/or choose them specihcally. SNaQ calculates the number of quartets 
involving each taxon and provides information about under-represented taxa, if any. With 
many individuals per species, one may greatly reduce the collection of 4-taxon sets to be 
analyzed by randomly sampling from those containing at most one individual per species. If 
the assignment of individuals to species is correct, any 4-taxon set containing 2 individuals 
from the same species would be non-informative about the species-level relationships. This 
strategy is used in [42] to infer species trees under ILS. 

Uncertainty in the number of hybridizations Model selection is necessary to estimate 
the number of hybridizations h, because the pseudolikelihood is bound to improve as h 
increases, like the likelihood or parsimony score in [13]. We used here the log pseudolikelihood 
prohle with h. A sharp improvement is expected until h reaches the best value and a 
slower, linear improvement thereafter. Such data-driven slope heuristics can indeed be used 
with contrast functions (like pseudolikelihoods) for model selection in regression frameworks 

gniiis]. 
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Information criteria have already been used to select h (e.g. [11] ), but these criteria 
are inappropriate if the full likelihood is replaced by a pseudolikelihood. Theory is missing 
to compare the pseudolikelihoods of different networks, because of the possible correlation 
between quartets from different 4-taxon sets. It can be shown, however, that quartets from 
two 4-taxon sets si and S 2 are independent if si and S 2 overlap by at most one taxon and if 
the true 4-taxon subnetworks share no internal edges. Future work could exploit this partial 
independence to construct hypothesis tests. 

Cross validation has been proposed by [S], and was shown to have good performance. 
In our framework, the cross-valication error could be measured from the difference between 
the quartet CFs observed in the validation subset and the quartet CFs expected from the 
network estimated on the training set. Because iC-fold cross-validation requires partitioning 
the loci into K subsets and re-estimating a network K times at each h value, this approach 
can be computationally heavy. 

Finally, [32] proposed a goodness-of-fit test, also based on quartet CFs, to determine if 
a tree with ILS hts the observed data or if a network is needed instead. This test could be 
extended to networks, to decide if a given h provides an adequate fit. One advantage to 
this approach is that testing the adequacy of a given h does not require to estimate a larger 
network with h + 1 hybridizations, whereas other approaches above would require estimation 
of both networks in order to decide that the simpler network is sufficient. 

Pseudolikelihood with rooted triples. After submission, we learned about similar work 
using subnetworks and a pseudolikelihood approach m, which scales to many taxa. In m, 
the pseudolikelihood is based on rooted triples whereas we use unrooted quartets. There are 
fewer triples, so the method in [47] is potentially faster. However, fewer triples means less 
information. For example, the networks Ti and T 2 shown in Fig. 2 of BH, which are not 
distinguishable from triplets, are in fact distinguishable from quartets (see SI Text). Our 
thorough study of the network identifiability allowed us to implement a search that avoids 
jumping between networks that are not distinguishable, which facilitates convergence. The 
downside of our approach is the assumption of a level-1 network. Instead, m do not assume 
any restriction on the network. Finally, our method does not require rooted gene trees as 
input, which we view as a major advantage because rooting errors are avoided. 


Acknowledgments 

We are very grateful to Douglas Bates for encouraging us to use Julia. We thank Noah 
Stenz for helping with the CF estimation pipeline, Ray Cui for providing MrBayes output 
of their original analysis, Sarah Friedrich for her help with artwork, David Baum for many 
meaningful discussions and two anonymous reviewers whose constructive feedback greatly 
improved the manuscript. 


21 


References 


[1] Huson D, Rupp R, Scornavacca C. 2010. Phylogenetic Networks. New York, NY: 
Cambridge University Press, 1st edition. 

[2] Spillner A, Bastkowski S, Moulton V, Griinewald S, Bogershausen A. 2013. SuperQ: 
computing supernetworks from quartets. IEEE/ACM transactions on computational 
biology and bioinformatics / IEEE, ACM. 10:151-60. 

[3] Than C, Ruths D, Nakhleh L. 2008. PhyloNet: a software package for analyzing and 
reconstructing reticulate evolutionary relationships. BMC bioinformatics. 9:322. 

[4] Yang J, Griinewald S, Xu Y, Wan XF. 2014. Quartet-based methods to reconstruct 
phylogenetic networks. BMC systems biology. 8:21. 

[5] Gambette P, Berry V, Paul G. 2012. Quartets and unrooted phylogenetic networks. J 
Bioinform Comput Biol, 10 (4): 1250004. 

[6] Meng G, Kubatko LS. 2009. Detecting hybrid speciation in the presence of incomplete 
lineage sorting using gene tree incongruence: a model. Theoretical population biology. 
75:35-45. 

[7] Strimmer K, Moulton V. 2000. Likelihood analysis of phylogenetic networks using 
directed graphical models. Molecular biology and evolution. 17:875-81. 

[8] Yu Y, Degnan JH, Nakhleh L. 2012. The probability of a gene tree topology within 
a phylogenetic network with applications to hybridization detection. PLoS genetics. 
8:el002660. 

[9] Yu Y, Dong J, Liu KJ, Nakhleh L. 2014. Maximum Likelihood Inference of Reticulate 
Evolutionary Histories. PNAS. 111:16448-16453. 

[10] Nguyen Q, Roos T. 2015. Likelihood-based inference of phylogenetic networks from 
sequence data by PhyloDAG. in Proc. 2nd International Conference on Algorithms for 
Computational Biology, LNBl 9199, Springer, ppl26-140. 

[11] Gui R, Schumer M, Kruesi K, Walter R, Andolfatto P, Rosenthal GG. 2013. Phy- 
logenomics reveals extensive reticulate evolution in Xiphophorus hshes. Evolution; 
international journal of organic evolution. 67:2166-79. 

[12] Francis, AR, Steel, M. 2015. Which Phylogenetic Networks are Merely Trees with 
Additional Arcs? Systematic Biology. 64(5):768-777. 

[13] Liu L, Yu L, Edwards SV. 2010. A maximum pseudo-likelihood approach for estimating 
species trees under the coalescent model. BMC evolutionary biology. 10:302. 


22 


[14] Baum DA. 2007. Concordance trees, concordance factors, and the exploration of retic¬ 
ulate genealogy. Taxon. 56:417-426. 

[15] Ane C. 2010. Reconstructing concordance trees and testing the coalescent model from 
genome-wide data sets. Chapter 3, p.35-52 in ’’Estimating species trees: Practical and 
theoretical aspects”, L. Knowles and L. Kubatko, eds. Wiley-Blackwell. 

[16] Allman ES, Degnan, JH, Rhodes JA. 2011. Identifying the rooted species tree from the 
distribution of unrooted gene trees under the coalescent, J. Math. Biol.. 62(6):833-862. 

[17] Grayson DR, Stillman ME. Macaulay2, a software system for research in algebraic 
geometry, available at http://www.math.uiuc.edu/Macaulay2/. 

[18] Allman ES, Ane C, Rhodes JA. 2008. Identihability of a Markovian model of molecular 
evolution with Gamma-distributed rates. Adv. in Appl. Probab. 40(l):229-249. 

[19] Allman ES, Rhodes JA. 2009. The Identihability of Covarion Models in Phylogenetics, 
lEEE/ACM Trans Comput Biol Bioinform, 6(l):76-88. 

[20] Pickrell JJK, Pritchard JJK. 2012. Inference of population splits and mixtures from 
genome-wide allele frequency data. PLoS genetics. 8:el002967. 

[21] Bezanson J, Edelman A, Karpinski S, Shah VB. 2014. Julia: A fresh approach to 
numerical computing http://arxiv.org/abs/1411.1607 

[22] Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. 2014. AS¬ 
TRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 30:i541- 
i548. 

[23] Avni E, Cohen R and Snir S. 2015. Weighted quartets phylogenetics. Systematic 
Biology, 64(2):233-242. 

[24] Snir S, Rao S. 2012. Quartet MaxCut: A fast algorithm for amalgamating quartet 
trees. Molecular Phylogenetics and Evolution, 62(l):l-8. 

[25] Huber KT, Linz S, Moulton V, Wu T. 2015. Spaces of phylogenetic networks from 
generalized nearest-neighbor interchange operations. Journal of Mathematical Biology. 
In press. 

[26] Larget B, Kotha SK, Dewey CN, Ane C. 2010. BUCKy: Gene tree / species tree 
reconciliation with Bayesian concordance analysis. Bioinformatics, 26(22):2910-2911. 

[27] Ane C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian estimation of con¬ 
cordance among gene trees. Molecular biology and evolution. 24:412-26. 

[28] Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: Bayesian phylogenetic inference under 
mixed models. Bioinformatics 19:15721574. 


23 


[29] Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post¬ 
analysis of large phylogenies. Bioinformatics 30:13121313. 

[30] Hudson RR. 2002. Generating samples under a Wright-Fisher neutral model of genetic 
variation. Bioinformatics (Oxford, England). 18:337-338. 

[31] Rambaut A, Crassly NC. 1997. Seq-Gen: An application for the Monte Carlo sim¬ 
ulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 
13:235-238. 

[32] Stenz N, Large! B, Baum DA, Ane C. 2015. Exploring tree-like and non-tree-like 
patterns using genome sequences: and example using the inbreeding plant species 
Arabidopsis thaliana (L.) Heynh. Systematic Biology. 64(5):809-823. 

[33] Nakhleh L. 2010. A metric on the space of reduced phylogenetic networks. lEEE/ACM 
Transactions on Computational Biology and Bioinformatics. 7:218-222. 

[34] Solis-Lemus, C., M. Yang, and C. Ane. 2015. Inconsistency of species-tree methods 
under gene flow. Systematic Biology, accepted pending revisions. 

[35] Durand EY, Patterson N, Reich D, Slatkin M. 2011. Testing for ancient admixture 
between closely related populations. Molecular Biology and Evolution, 28:2239-2252. 

[36] Clark AG, Messer PW. 2015. Conundrum of iumbled mosquito genomes. Science 
347(6217):27-28. 

[37] Fontaine MC et al. 2015. Extensive introgression in a malaria vector species complex 
revealed by phylogenomics. Science 347(6217): 1258524 

[38] Jonsson MS et al. 2014. Speciation with gene flow in equids despite extensive chromo¬ 
somal plasticity, PNAS 111(52): 18655-18660 

[39] Camara PC, Levine AJ, Rabadan R. 2015. Inference of ancestral recombination graphs 
through topological data analysis, http: //arxiv. org/abs/1505.05815vl 

[40] Pardi F, Scornavacca C. 2015. Reconstructible Phylogenetic Networks: Do Not Dis¬ 
tinguish the Indistinguishable. PLOS Computational Biology. Il:el004135. 

[41] Yu Y, Nakhleh L. 2015. A distance-based method for inferring phylogenetic networks 
in the presence of incomplete lineage sorting. Bioinformatics Research and Appli¬ 
cations, Lecture Notes in Computer Science 9096, Springer International Publishing 
Switzerland, pp.378-389 

[42] Chifman J, Kubatko L. 2014. Quartet inference from SNP data under the coalescent 
model. Bioinformatics, 30(23):3317-3324. 

[43] Yu Y, Barnett RM, Nakhleh L. 2013. Parsimonious inference of hybridization in the 
presence of incomplete lineage sorting. Systematic biology. 62:738-51. 


24 


[44] Kubatko, LS. 2009. Identifying hybridization events in the presence of coalescence via 
model selection, Systematic Biology 58(5):478-488 

[45] Baudry J, Maugis C, Michel B. 2012. Slope heuristics: overview and implementation. 
Statistics and Computing, 22(2):455-470 

[46] Birge L, Massart P. 2006. Minimal penalties for Gaussian model selection. Probab. 
Theory Relat. Fields, 138:33-73. 

[47] Yu Y., Nakhleh, L. A maximum pseudo-likelihood approach for phylogenetic networks. 
BMC Genomics 2015, 16(Suppl 10): SIO. 

[48] Bezanson J, Karpinski S, Shah VB, Edelman A. 2012. Julia: A Fast Dynamic Language 
for Technical Computing, http://arxiv.org/abs/1209.5145. 

[49] Cox D, Little J, O’Shea D. 2007. Ideals, varieties, and algorithms. Springer, third 
edition. 

[50] Feller W. 1950. Introduction to Probability Theory vol. 1. New York, NY: Wiley, third 
edition. 

[51] Huelsenbeck JP, Ronquist F. 2001. MrBayes: Bayesian inference of phylogeny. Bioin¬ 
formatics. 17:754-755. 

[52] Massart P. 2007. Concentration inequalities and model selection, Ecole d’Ete de 
Probabilites de Saint-Flour 1896. Springer-Verlag Berlin Heidelberg. 


25 


SI Text: Supplementary Material 

A Quartet CFs under the coalescent with hybridiza¬ 
tion 


A.l Quartet CFs for a 4-taxon network with one hybridization 

A argument similar to what is described in the main text was applied to all 4-taxon networks 
with h = 1. The results are summarized below. 
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CFab\cd = (1 — 7)^(1 ~ 2/3exp(—ti — 12 )) 

+ 27(1 - 7)(1 - exp(-ti) -h l/3exp(-ti - t^) 

-t 7^(1 - 2/3 exp(-ti - U)) 

CFac\bd = (1 — 7)^(l/3exp(—— 12 )) 

-t7(l -7)exp(-ti)(l - l/3exp(-t3)) 

-t 7^(l/3exp(-ti - U)) 

CFad\bc = (1 — 7)^(l/3exp (—<1 — 12 )) 

+ 7(1 - 7 )exp(-ti)(l - l/3exp(-t3)) 

-t 7^(l/3exp(-ti - U)) 

CFab\cd = (1 -7)(1 - 2/3exp(-ti)) -^ 7(1 - 2/3exp(-ti - ^ 2 )) 
CFac\bd = (1 - 7 ) 1/3 exp(-ti) 7 I /3 exp(-ti - ^ 2 ) 

CFad\bc = (1 - 7 ) 1/3 exp(-ti) -h 7 I /3 exp(-ti - ta) 


CFab\cd = (1 -7)(1 - 2/3exp(-ti)) -h 7(1/3exp(-t 2 )) 
CFac\bd = (1 - 7 ) 1/3 exp(-ti) -h 7(1 - 2/3 exp(-< 2 )) 
CFad\bc = (1 - 7 ) 1/3 exp(-ti) -h7l/3exp(-t2) 

CFab\cd = (1 - 7)^(1 “ 2/3exp (—— t 2 — ts)) 

+ 27(1 - 7)(1 - 2/3exp(-ti - ^ 2 )) 

-I- 7^(1 - 2/3exp(-ti — t 2 - tA) 

CFac\bd = (1 - 7 ) 71/3 exp(-ti - t 2 -tA) 

+ 27(1 - 7 )(l/ 3 exp(-ti - t 2 )) 
-l- 7 ^(l/ 3 exp(-ti -t 2 - U)) 

CFad\bc = (1 - 7)^(1/3 exp(-ti - ^2 - ta)) 

-t 27(1 - 7 )(l/ 3 exp(-ti - t 2 )) 

+ 7^(l/3exp(-ti -t 2 - < 4 )) 
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Network 5 


CFab\cd = 1 — 2/3exp(—ti) 
CFac\bd = l/3exp(—ti) 
CFad\bc = l/3exp(—<i) 


A.2 Subnetwork equivalence for quartet CFs 


We show here how a level-1 four-taxon network with any number of hybridizations can be 
reduced to an equivalent network with h = 0 or 1 hybridizations only, where equivalence 
means that both networks give the same quartet CFs. The network is reduced by induction 
on h, by replacing each “blob” in the network (a set of nodes and edges along a given cycle 
[25] ) by a simpler network. To simplify notations we use Zi = 1 — exp(—f*). A subnetwork is 
of type 1 if it leads to two equal minor CFs, and is therefore equivalent to a subnetwork with 
no hybridization. A subnetwork is of type 2 if it leads to three different quartet CFs, and 
is not equivalent to a tree. We summarize below all the possible subnetwork conhgurations 
on 4 taxa (on the left in each hgure) and the equivalent form (on the right), with the 
formula giving the branch length in the equivalent network to obtain the same CFs from 
both networks. 


exp(-ti) := 1 -I- 7Z3 - 7^24 - 7^23 - (1 - 7)^22 





tl 



exp(-fi) := 1 - yz2 


Subnetwork 2 (type 1) 



Subnetwork 3 (type 2) 
7 



Subnetwork 4 (type 1) 


3CF4B|cn = 1 “ 2(1 — 7 )zi — 72:2 
tt>CFAc\BD = 1 -t 272:2 — (1 — l)zi 
?> CFad\bc = 1 - (1 - 7)2:1 - 72:2 


exp(-ti) := exp(-t 2 )[l - I '^ z ^ - (1 - 


A.3 Four-taxon networks with more than one hybridization 

We can use the subnetwork equivalence just described to account for multiple hybridizations 
on the same network (see example on Fig. [SDleft). Even when none of the hybridizations 
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are of type 1 on the full network, some hybridizations may become of type 1 after pruning 
taxa. Fig. [SI] (right) shows such an example: when reduced to the four taxa 1, 2, 4 and 10, 
the network in Fig. [SS] (d) has two hybridizations of type 1. Both can be removed with the 
subnetwork equivalence described in the previous section. 



Figure SI: Left: Network on 4 taxa with multiple hybridizations of type 1. This network 
is equivalent to a 4-taxon tree (without hybridizations) in terms of expected CFs. Right: 
4-taxon network extracted from the network in Fig. [SS] (d) with 2 hybridizations of type 1. 
This network is equivalent to a 4-taxon tree (without hybridizations) in terms of expected 
CFs. 


B Topology identifiability 

B.l 5-taxon network: topology identifiability 

Assume hrst h = 1. A network with n = 5 taxa has = 5 four-taxon subsets, and each one 
has three possible quartets. Each quartet CF expected on this network is given in previous 
section lA.ll Thus, a 5-taxon network dehnes 3 * 5 = 15 quartet CF equations. The same is 
true for a 5-taxon species tree obtained by removing the hybridization event in the network, 
but with different CF formulas. 

The hybridization in the network is identihable if the same set of quartet CF values cannot 
solve simultaneously the system of equations for the network and the system of equations 
for the tree. If there exists a set of CF values that can solve both systems of equations, then 
we cannot identify the network from the tree based on quartet CFs. Thus, the conditions 
under which CFnetwork = CFtree have a solution allow us to identify the conditions under 
which the network is not identihable from the tree. Take, for example, the 5-taxon network 
in Fig. [S2](left) with system of CF equations in table (ST] To prove that it is identihable from 
the tree in Fig. [S2] (right), we computed the 15 quartet CFs ci,C 2 , ...,ci 5 for the tree with 
the formulas in table [S2] for given values of branch lengths. Then, with aid of Macaulay2, 
we investigated if the set of equations for the network would have a solution when equal to 
those CF values. It turns out that regardless of the values of the tree branch lengths ui,U 2 , 
the 5-taxon network is identihable from the tree as long as to > 0, ti > 0, tn < oo and 
7 6 ( 0 , 1 ). 






Table SI: System of concordance factors equations for 5-taxon network in Fig. [S2] (left) 
CFab\cd = (l- 7 )(l- 2 / 3 exp(-ti))-F 7 ( 1 / 3 exp(-to)) 

ABCD CFad\bc = (1 - 7 )(l/ 3 exp(-ti)) 7(1 - 2/3exp(-to)) 

_ CFac\bd = (1 - 7 )(l/ 3 exp(-ti)) + 7 ( 1 / 3 ) exp(-to)_ 

CFab\ce = (1 -7)(1 - 2/3exp(-ti)) 7 ( 1 / 3 exp(-to)) 

ABDE CFae\bc = (1 - 7 )(l/ 3 exp(-ti)) + 7(1 - 2/3exp(-to)) 

_ CFac\be = (1 - 7 )(l/ 3 exp(-ti)) -F 7 ( 1 / 3 ) exp(-to)_ 

CFac\de = (1 - 7)(1 - 2/3exp(-ti - tu)) + 7(1 - 2/3exp(-tii)) 
ABDE CFad\ce = (1 - 7 )(l/ 3 exp(-ti - tn)) 7 ( 1 / 3 exp(-tii)) 

_ CFae\cd = (1 - 7)(l/3exp(-ti - tn)) 7(1/3exp(-tii))_ 

CFac\de = 1 — 2/3exp(—til) 

ACDE CFad\ce = l/3exp(—til) 

_ CFae\cd = l/3exp(—til)_ 

CFac\de = (1 - 7)(1 - 2/3exp(-tii)) - 1 - 7(1 - 2/3exp(-to - tii)) 
BODE CFad\ce = (1 - 7 )(l/ 3 exp(-tii)) 7 ( 1 / 3 exp(-fo -tii)) 

CFae\cd = (1 - 7)(l/3exp(-tii)) 7 ( 1 / 3 exp(-to -tii)) 


Table S2: System of concordance factors equations for 5-taxon unrooted tree in Fig. [S2] 
(right) 

CFab\cd = 1 — 2 / 3 e““i 

ABCD CFAD\BC = l/^e-^^ 

_ CFac\bd = 1/3 _ 

CFab\ce = 1 — 2 / 3 e““i 

ABDE CFAE\BC = l/^e-^^ 

_ CFac\be = l/3e~^i_ 

CFac\de = 1 — 2 / 3 e ““2 
ACDE CFadice = 1/3 
_ CFae\cd = 1/3 _ 

CFbc\de = 1 — 2/3 

BCDE CFBD\CE = l/^e-^^ 

_ CFbe\cd = 1/3 e~'^^_ 

CFAB\DE = l-2/3e-^^-^^ 

ABDE CFAD\BE = l/Se-^^-^^ 

CFae\bd = l / 3 e ““ i ““^ 
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Figure S2: The network (left) is identifiable from the tree (right) based on quartet CFs, 
provided that to > 0 , ti > 0 , tn < oo and 7 G ( 0 , 1 ), for any value of Ui and U 2 - 

A similar study was done to every possible 5-taxon network with one hybridization to 
conclude that if t* > 0 for all tree edge i and if 7 G (0,1), then the network is generically 
identifiable from the tree (as long as the number of nodes in the hybridization cycle is k > 3). 

B.2 n-taxon network: topology identifiability 

To generalize to n > 5 taxa, not all (”) x3 quartet CF equations are needed. In Fig. [S3l the 
networks are ordered by the number of nodes in the hybridization cycle. Ignoring the case 
with k = 2 which is not identifiable from a tree, and assuming that the network is level- 1 , 
the hybridization cycle in the network is attached to k subnetworks, each represented by 
a small triangle with a given number of taxa n*. With k = 3 nodes in the hybridization 
cycle for example, the network has three subgraphs, each with no,ni,n 2 taxa respectively. 
If 77,0 > 3 in the top-right subnetwork, one can form a 4-taxon subset by taking three taxa 
from this subnetwork and one taxon from any of the other 2 subnetworks. However, the 
three CF formulas for such a 4-taxon subset do not involve any of the parameters near the 
hybridization cycle of interest. That is, the CFs resulting from choosing three taxa from one 
subtree and one taxon from another subtree yield no information about the hybridization 
event of interest. In other words, we can ignore all the equations that involve three taxa 
from the same subgraph. 

Thus, to study the identifiability of an n-taxon network, it suffices to consider only the 
subset of quartets involving rii < 2 taxa for any subnetwork i. If only one tip is taken for 
subnetwork i, then hybridizations in this subnetwork do not affect the quartet CFs. If the 
4-taxon set has n^ = 2 tips from subnetwork i, then there could be a hybridization involving 
them within this subnetwork. Because the network is of level 1 , however, this hybridization 
has to be of type 1 (see Section and can be removed without affecting the CFs, using a 
transformed branch length leading to subnetwork i. Thus, we can study the identifiability of a 
given hybridization of interest ignoring hybridizations in the subnetworks. Using Macaulay2, 
we found the same sufficient conditions as with n = 5 for generic identifiability of the 
topology: t G ( 0 , cxo ) for all tree branch lengths and 7 G ( 0 , 1 ). We give below weaker, but 
still sufficient conditions for each value of k. The n-taxon network with one hybridization 
and k nodes in this hybridization cycle is generically identihable from a tree with the same 
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Figure S3: Characterization of n-taxon networks depending on the number of nodes in the 
hybridization cycle: /c = 2, 3,4, 5 in order from left to right. 


topology but the hybridization removed if: 

k = 3 : < oo; to > 0; 7 6(0,1) 

(1 - 7)(1 - exp(-ti)) ^ 7((1 - exp(-t 2 )) + (1 - exp(-to))) 
7(1 - exp(-t 2 )) 7^ (1 - 7)((1 - exp(-to)) + (1 - exp(-ti))) 

k = 4 : ti2,ti3 <oo; to,ti>0; 7 6(0,1) 

7(1 - exp(-t3)) 7^ (1 - 7)((1 - exp(-ti)) + (1 - exp(-t 2 ))) 

k = 5 : ti 3 , ti 4 < oo ; to , ti , t2 >0; 7 6(0,1) 

7(1 - exp(-t4)) 7^ (1 - 7)((1 - exp(-t2)) + (1 - exp(-t3))) 


C Parameter identifiability 

We study here whether branch lengths and heritabilities are identihable from quartet CFs, 
given a hxed network topology. 

C.l Identifiability of parameters in a 5-taxon network 

As mentioned before, a network with n = 5 taxa has 15 quartets, with quartet CFs given in 
section IXdl Thus, a 5-taxon network dehnes 15 CF equations in the unknown parameters of 
branch lengths and 7 . If this system has a unique solution for all the unknown parameters 
given a set of CF values ci, C 2 ,..., C 15 , then we say that the parameters are identihable. For 
example, recall the 5-taxon network in Fig. [S2] (left) with system of CF equations in table 
[ST] From a quick inspection, it is evident that not all vectors of CF values can yield a 
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solution. For instance, the quartet CFs are identical for the 4-taxon sets ABCD and ABCE. 
Thus, if we consider the first 6 equations: 


CFab\cd = (1 -7)(1 - 2/3exp(-fi)) 7 ( 1 / 3 exp(-fo)) = Ci 

CFad\bc = (l-7)(l/3exp(-fi))-F7(l-2/3exp(-fo)) = C 2 
CFac\bd = (1 - 7)(l/3exp(-fi)) 7 ( 1 / 3 ) exp(-to) = C 3 

CFab\ce = (1 - 7)(1 - 2/3exp(-ti)) -h 7(1/3exp(-to)) = C 4 
CFae\bc = (l- 7 )(l/ 3 exp(-fi))-h 7 (l- 2 / 3 exp(-fo)) = 05 
CFac\be = (1 - 7)(l/3exp(-fi)) 7 ( 1 / 3 ) exp(-to) = Cg, 

it is obvious that a solution can exist only if 

Cl = C 4 , C 2 = C 5 , and C 3 = cq. 

In addition to these three conditions, there are many others that the CF values ci, C 2 ,..., C 15 
need to fulhll for a solution to exist. In other words, the structure of the network imposes 
conditions on the CF values that need to be satisfied for the system of equations to be 
consistent. We will call these conditions invariants, which are like consistency checks for 
the system. On a tree, the invariants are easy to list. For the tree in Fig. 15^ for example, 
equating the expected quartet CFs from table [S2] to a set of CF values Cj gives the following 
system of equations for the 2 unknown branch lengths Ui and U 2 '- 

= 1 — 2/3 e = Ci 
CFad\bc = l/3e = C 2 

CFac\bd = 1/3 e = C 3 

CFab\ce = 1 — 2/3e “^ = 04 
CFae\bc = l/3e = C 5 

CFac\be = l/3e = cg 

CFac\de = 1 — 2/3 e = C 7 
CFad\ce = l/3e = Cg 

CFae\cd = l/3e = cg 

CFbc\de = 1 — 2/3 e = Cio 
CFbd\ce = l/3e = Cii 

CFbe\cd = l/3e = C 12 

CFab\de = 1 — 2/3 e = C 13 
CF^D|BE = l/3e-“i-“^ = ci4 
CFae\bd = l/3e = C 15 . 

Several equations are obviously identical. Hence, these conditions on the c* values are nec¬ 
essary for the system to have a solution {ui,U 2 )'. 

Cl = C 4 , C 2 = C 3 = C 5 = cg, C 7 = Cio, cg = cg = Cii = C 12 and C 14 = C 15 . 
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In addition, the three CF from each 4-taxon set must add up to 1; 

Cl + C2 + C3 = C4 + + Cq = Ct + Cs + Cg = Cig + Cu + C12 = C13 -|- C14 -|- C15 = 1 . 

Finally, three times the minor CF of ABCD multiplied by three times the minor CF of ACDE 
should be equal to three times the minor CF of ABDE. This is because exp(—ui) exp(—M 2 ) 
should be equal to exp (—mi — U 2 ). We thus obtain the following additional invariant 

(3C2)(3C8) = 3Ci4. 

For this species tree, we obtained 13 independent invariants that the q values must satisfy for 
the system to have a solution. If the values Ci, C 2 ,..., C 15 are obtained from a tree with branch 
lengths ul,U 2 (using table IS^ . then these c, values automatically satisfy the invariants of 
the system, and (mi,M 2 ) = {ul,U 2 ) is necessarily one solution of the system of equations. 
Our hope is that this solution is unique. 

The question of identifiability can now be restated in terms of the number of algebraically 
independent equations that the system has. For the tree above, there are 15 original equa¬ 
tions but 13 independent invariants, therefore two algebraically independent equations. Since 
we only have 2 unknown parameters Ui and U 2 , we can solve for them. We can do a similar 
analysis with the network example in Fig. [S2]with CF equations in table EH We have a 
system of 15 equations, and we would like to know how many independent invariants are 
dehned by this system in order to determine how many algebraically independent equations 
we have. We know from algebraic geometry [19] that a system with the same number of al¬ 
gebraically independent equations as unknown parameters has hnitely many solutions. This 
does not imply that the parameters are identifiable, but it does imply that the parameters 
are generically indentifiable. It also implies that, given perfect data (infinitly many genes, 
correctly reconstructed gene trees), the pseudolikelihood has hnitely many maxima. Proving 
uniqueness of solution is a hard problem in algebraic geometry and is beyond the scope of 
the present work. 

Therefore, to study the generic identihability of parameters, we obtained the system of 
quartet CF equations for each network, and we verihed whether the number of algebraically 
independent equations was equal to the number of parameters. We automated this using 
Macaulay2. 

The network in Fig. EH needs four parameters ( 7 , to, ^ 1 , ^ii) and 15 equations. In this 
example, the 15 equations have 12 independent invariants. Thus, we only have three alge¬ 
braically independent equations and four parameters. This means that the system has an 
inhnite number of solutions and we cannot solve for 7 ,to,ti,tii. 

In this case, which is a bad diamond I, we decided to reparametrize ( 7 , to,^ 1 ,^ii)- The 
quartet CFs can be expressed in terms of the following 3 parameters only, which are identi¬ 
fiable from the 3 algebraically independent equations: x = 7(1 — exp(—to)), y = (1 — 7)(1 — 
exp(—ti)) and tn. 

They have the following interesting interpretation: the lineage from species B in Fig. EH 
(left) either originated from the hybridization edge and coalesced with C along the edge of 
length to with probability x = 7(1 — exp(—to)), or it originated from the other hybridization 
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edge and coalesced with A along the edge of length ti, with probability y = (1 — 7)(1 — 
exp(-ti)). 

C.2 n-taxon network: parameter identifiability when h = 1 

In this section we assume that the network has a single hybridization (assumption relaxed 
in the next section) and we seek to determine if the parameters around the cycle created 
by this hybridization are identihable (Fig. [S3]), given the network topology. Parameter 
identifiability depends on the number of nodes in the cycle created by the hybridization 
event. We summarize the results below. 

For k = 3, the parameters are not identifiable if n < 5. If n > 6 and rij > 2 for all 
i = 0,1,2 (see Fig. [S3|), we have 6 algebraically independent equations and 7 parameters. 
Thus, the 7 parameters are not identihable, but 6 of them are identihable if the remaining 
parameter if known. We decided to set tu = 0 and estimate the other 6 paramerers. We 
call this case a good triangle. 

For fc = 4, all parameters are identihable if either no > 2 (or n 2 , symmetrically), or if 
both ni and n^>2 (see Fig.jSSj). Parameters are not all identihable in the remaining 2 cases, 
which we call bad diamonds I and II (see Fig. [S3|). The bad diamond I was described in the 
previous section. For a bad diamond II, 6 parameters are needed around the hybridization 
and we found only 5 algebraically independent equations, with no simple reparametrization. 
So, we set fi 3 = 0 and solve for the remaining parameters. 

For fc = 5 all parameters are identihable. To show it, we hrst considered the case when 
no = ■ ■ ■ = ^4 = 1 , that is, we considered information from quartets with at most one taxon 
from each subnetwork. Using Macaulay2 we found that 7 and the 3 tree edge lengths in 
the cycle (toUiU 2 ) were identihable. Next, we used previous results from k = A. If n^ > 2 
and i ^ A, the length tu of the edge attaching subnetwork i is identihable because we can 
extract a good diamond from which we can identify tu. If n 4 > 2, the hybrid branch lengths 
to, ^4 and the subnetwork branch length ^44 are needed. To identify these parameters, we can 
extract a bad diamond II, which originally provided 6 algebraically independent equations. 
These are enough to identify the remaining 3 unknown branch lengths. 

For fc > 5 we can prove that all needed parameters are identihable by extracting sub¬ 
networks with k = 5, each identifying a diherent subset parameters, together spanning all 
parameters. 

Note that the branch lengths labelled in Fig. [S3] are not all needed for all networks. In 
particular, if rij = 1 then the branch leading to subnetwork i is an external branch, and its 
length tii is irrelevant for any CF. For a bad diamond I for example (rii > 2 but other n* = 1), 
Uo) ti 2 and tio are irrelevant and obviously non-identihable. In this bad diamond I, the hybrid 
branch lengths ^2 and to are also irrelevant, as they only have 1 descendant {no = 1 ) like 
external edges. These branch lengths were omitted from our study of identihability: we did 
not include them in the list of parameters to be studied and we excluded them from the list 
of parameters during the pseudolikelihood optimization search. 
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C.3 Parameter identifiability in level-1 networks 

We now extend the resnlts from the previous section to networks with h > 1, provided that 
the network is of level 1. As already noted in section [B.21 only a subset of quartets bear 
information on the hybridization of interest and on parameters around this hybridization; 
those involving rii <2 taxa from any given subnetwork i. Because the network is of level 1, 
any hybridization in subnetwork i does not affect the quartet CFs if Uj = 0 or Uj = 1. For 
4-taxon sets that need to involve n* = 2 taxa from subnetwork i, we noted in section IB.21 
that the subnetwork reduced to these 2 taxa was equivalent (in terms of quartet CFs) to 
a subnetwork with no hybridizations where the 2 taxa are sister (subnetwork of type 1 in 
section lA.2D . This may come at the cost of having to transform the length tu of the branch 
linking the cycle from the hybridization of interest to subnetwork i (Fig. [S3]). If there is at 
least one pair of taxa from subnetwork i such that the equivalent subnetwork reduced to this 
pair is separated from the cycle by a branch of untransformed length tu, then the results 
stated in the previous section apply directly. If there is no such pair of taxa from subnetwork 
i, then the results stated in the previous section apply to the transformed branch length tu 
instead. 

D Heuristic search in the space of networks 

The search is initialized with a user-specified network, which could be a tree obtained with a 
very fast quartet-based tree estimation method like ASTRAL [22]. If a tree topology is given 
with no branch lengths, those are initialized using the average observed CF of the quartets 
that span that branch exactly, CF, transformed to coalescent units by f = — log(l — 3/2 CF). 
By default, SNaQ performs 10 independent searches, each using a starting topology based on 
the user-specified network or tree. For each search, an NNI is performed on the user-specified 
topology with probability 0.7 by default, so that independent runs have different starting 
topologies. If the starting topology is a network, then with probability 0.7 by default the 
origin (or target) of a hybrid edge is moved. Each search then navigates the network space 
by altering the current network using one of 5 proposals, chosen at random: 

1 . Move the origin of an existing hybrid edge. A hybrid node is chosen uniformly at ran¬ 
dom, then one of the 2 parent edges of that node is chosen according to the inheritance 
probabilities: the edge with inheritance 7 is chosen with probability 1 — 7 . A new edge 
is then chosen at random from the vicinity of the current origin. This edge is cut into 
2 smaller edges to insert the new origin. The two branches at the old origin are merged 
along with their branch lengths, whose proportions are used for the 2 new branches 
created around the new origin. 

2. Move the target of an existing hybrid edge, similarly to the previous move. 

3. Change the direction of an existing hybrid edge. A hybrid node is chosen uniformly 
at random, then one of the 2 parent edges of that node is chosen according to the 
inheritance probabilities: the edge with inheritance 7 is chosen with probability 1 — 7 . 
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The direction of the chosen hybrid edge is flipped. The former hybrid node becomes 
a tree node and the other node attached to the hybrid edge becomes a hybrid node. 
Branch lengths are left unchanged. 

4. Perform a nearest-neighbor interchange (NNI) on a tree edge, around a tree edge 
chosen uniformly at random. [25] describe a similar type of NNI that yields a level-1 
(unrooted) network, and showed that the resulting network is level -1 if the chosen tree 
edge is not a link. Branch lengths are left unchanged. 

5. Add a hybridization if the current topology has h < hm- Two tree edges are chosen 
at random and a new hybrid edge is created between these 2 edges. The new nodes 
(origin and target) are placed uniformly along each chosen edge. 7 is drawn uniformly 
in (0,0.5). The new edge length is initialized at 0. If the new cycle intersects a 
previously existing cycle, the origin or the target is chosen at random, to be moved to 
a neighboring edge immediately. 


Any new proposed network is checked to verify that it is of level 1, with h < hm and with at 
least one valid placement for the root. If not, the move fails immediately and a new move is 
proposed at random. The search continues until one the following criterion is reached: 


• the absolute difference between the pseudolikelihood value of the newly accepted and 
the current network is smaller than a tolerance threshold, 0.001 by default. 

• the pseudo-deviance (difference between the theoretical maximum pseudolikelihood 
and the network pseudolikelihood) is below the tolerance threshold. 

• the number of failed moves reaches a limit, 100 by default. 


• for all move types, the number of failed attempts reaches a limit, which is specihc 
to the proposal type, n and h. This is to avoid repeated proposals of the same new 
network. For example, on a network with h = 1, there are at most A^i = 8 ways to 
move the origin of either hybrid edge to a neighboring location. For this network, we 
chose a limit of 28 failed moves of type 1 to ensure that all of the 8 distinct proposals 
were attempted with high probability, based on the theory of the coupon’s collector 
[50] . In general, for N = Ni distinct ways to attempt a move of type i, the upper 


N 


threshold was set to N - - 


2=1 


—N. We calculated W for each move type. In general 


there are A^i = A '"2 = 8h ways to move a hybrid origin or target and = 2h ways 
to change a hybrid edge direction. There are 2n — 3 internal tree edges on n taxa, so 
there are A 4 = 2n — 3 ways to propose an NNI and = {^"' 2 ^) ways to choose two 
tree edges to add a new hybridization. 
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E Identifiability from quartets versus triples 


A pseudolikelihood based on rooted triples is used in [17] to estimate a rooted network, 
whereas we use unrooted quartets to estimate a semi-directed network. We show here that 


unrooted quartets provide more information 
hable with rooted triples. For example, 071 
identihable by their set of triples (see Fig. 
identihable from the set of quartets. 



A B C D E 

(«) I'l 


to identify networks, that may not be identi- 
present two networks \ki and \[^2 that are not 
[54]) . We show here that these networks are 



Figure S4: Example from |47| of networks non-identihable with rooted triples from the 4 
ingroup taxa {A, B,C, D). These two networks are identihable from each other based on 
unrooted quartets from all 5 taxa. 


Rooted triples 
Unrooted quartets 


ABC ACD ABD BCD 
ABCE ACDE ABDE BCDE ABCD 


Table S3: The collection of 3-taxon subsets from a 4-taxon rooted network corresponds in 
part to the collection of 4-taxon subsets from an unrooted network on the same 4 taxa 
(A,B,C,D) plus an outgroup (E). Extra subsets are obtained by sampling ingroup taxa only, 
like ABCD. 


A rooted triple is equivalent to an unrooted quartet by adding the outgroup used to root 
the triple, if gene trees were initially rooted using an outgroup. If gene trees were rooted 
using a molecular clock assumption or midpoint rooting, for example, the root still implies 
an outgroup shared by all gene trees. Thus, the full set of rooted triples correspond to the 
unrooted quartets on any 3 ingroup taxa and the outgroup (see table [S3|). With 4 ingroup 
taxa, the collection of 4-taxon subsets has one extra set that provides enough information 
to distinguish between the two networks in Fig. EH Fig. EH corresponds to the networks 
Ti and 4/2 in [17| with an added outgroup E. As mentioned in previous sections, the 15 CF 
equations are not independent as they need to add up to one for any given 4-taxon subset 
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(or 3-taxon subset) and some subsets are of type 1 so the two minor CFs are equal. The 
independent formulas for the CFs from with added outgroup E are: 


CFbciae = (1 - /5)(1 - 2/3e-'3) + /31/3e-'^ 

CFbaice = (1 - /9)l/3 + /3(1 - 2/3 

CFbe\cd = 2/3 e-'i) + /3{1 - 2/3 

CFAE\CD = l-‘2/3e-‘^-^^ 

CFab\cd = (1 - /9)(1 - 2/3 e-'i) + /3(1 - 2/3 

whereas the formulas from T 2 are 

CFbciae = a(l - 2/3 e"'!) + (1 - a)l/3e-'^ 

CFba\ce = al/3e-'’i -t- (1 - a)(l - 2/3 e-^^) 

CFbe\cd = a^(l - 2/3 + 2 a(l - a)(l - + 1/3 + (1 - af{l - 2/3 

CFae\cd = a^(l - 2/3 6 -^"-'’®-'’!) + 2 a(l - a)(l - e"'’" + 1/3 + (1 - af{l - 2/3 

CFab\cd = «^(1 - 2/3 + 2 a(l - a)(l - e"'’" + l/3e“^""'’i) + (1 - af{l - 2/3 

If we set 62 = 1,&3 = 2,64 = 1,65 = 0,61 = l,a = 0.1,/3 = 0.663163 ,/2 = 1.951019,/g = 

0.207841,/i = 1.841435 as in jlTj, we see that the first 4 CFs are identical between Ti and 
\[' 2 , but the fifth one (corresponding to the subset without the outgroup, not present in the 
triples) differs between Ti and 4 / 2 , allowing us to distinguish between both networks. 

CFbc\ae{'^i) = CFbc\ae{^2) = 0.18584 
CFba\ce{^i) = CFba\ce{^2) = 0.69153 
CFb_e|cd(4^i) = CFbe\cd{^2) = 0.90743 
CFae\cd{^i) = CFae\cd{^2) = 0.914114 
CFab\cd{^i) = 0.92956 ^ CFab\cd{^2) = 0.956292. 
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F Simulated data 

Gene trees were simulated on the networks shown below. 


1.5 




6 

■5 

4 

■3 

2 


(a) n = 6 taxa, h = l hybridization (fc = 4) (b) n = 6 taxa, h = 2 hybridizations (fc = 4,4) 



(c) n = 10 taxa, h = 2 hybridizations {k = 4, 7) (d) n = 15 taxa, h = 3 hybridizations {k = 4, 5,6) 


Figure S5: True networks used for the simulations, including a bad diamond I (top right, 
n = 6 , with 7 = 0.2) and a bad diamond II (bottom left, n = 10, with 7 = 0.3). Branch 
lengths are in coalescent units. 
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G Xiphophorus fish network analysis 



Figure S6: Pseudo-deviance score vs number of hybridizations for the Xiphophorus hsh data 
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X. meyeri 



(b) = 1 


X. malinche X. birchmanni 



(c) h = 2 



X. meyeri X. gordoni 


(d) h = 3 
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{e) h = 4: {i) h = 5 


Figure S7: Estimated networks for the Xiphophorus fish data for h = 0 to 5. The estimated 
tree (h = 0) is rooted with the southern swordtail (SS) outgroup clade (pink). Networks with 
h > 1 are shown as semi-directed networks, as estimated. Hybrid edges, shown in green, are 
directed toward their hybrid node, which is the node where the two hybrid edges meet and 
represents an ancestral species of mixed parental origins. The minor hybrid edge is thinner 
and annotated with the inheritance probability ( 7 ) whereas the major hybrid edge is thicker 
and its inheritance probability is given by 1 — 7 (not shown). With h > 3, the direction of 
one hybridization in the SS clade conflicts with placing the root at the base of this clade (see 
main text). In the estimated network with h = 2 and 3, numbers in green represent bootstrap 
support for a given hybridization. In particular, the 75% for h = 3 (or 20% for h = 2) support 
value refers to the full hybridization cycle, including the placement of X. nezahuacoyotl as 
sister to X. montezumae in the major tree. X. nezahuacoyotl was placed differently (sister 
to the clade X. cortezi+X. hirchmanni+X. malinche) in 3 bootstrap networks, for which the 
inferred reticulation still had X. nezahuacoyotl as the recipient lineage. It also had the same 
donor lineage as shown in (c) or in (d). 
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